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Abstract 

Policy search methods can allow robots to learn control policies for a wide range of 
tasks, but practical applications of policy search often require hand-engineered components 
for perception, state estimation, and low-level control. In this paper, we aim to answer 
the following question: does training the perception and control systems jointly end-to- 
end provide better performance than training each component separately? To this end, 
we develop a method that can be used to learn policies that map raw image observations 
directly to torques at the robot’s motors. The policies are represented by deep convolutional 
neural networks (CNNs) with 92,000 parameters, and are trained using a guided policy 
search method, which transforms policy search into supervised learning, with supervision 
provided by a simple trajectory-centric reinforcement learning method. We evaluate our 
method on a range of real-world manipulation tasks that require close coordination between 
vision and control, such as screwing a cap onto a bottle, and present simulated comparisons 
to a range of prior policy search methods. 

Keywords: Reinforcement Learning, Optimal Control, Vision, Neural Networks 

1. Introduction 

Robots can perform impressive tasks under human control, including surgery (Lanfranco 
et al., 2004) and household chores (Wyrobek et al., 2008). However, designing the perception 
and control software for autonomous operation remains a major challenge, even for basic 
tasks. Policy search methods hold the promise of allowing robots to automatically learn new 
behaviors through experience (Kober et al., 2010b; Deisenroth et al., 2011; Kalakrishnan 
et al., 2011; Deisenroth et al., 2013). However, policies learned using such methods often 
rely on a number of hand-engineered components for perception and control, so as to present 
the policy with a more manageable and low-dimensional representation of observations and 
actions. The vision system in particular can be complex and prone to errors, and it is 
typically not improved during policy training, nor adapted to the goal of the task. 

In this article, we aim to answer the following question: can we acquire more effec¬ 
tive policies for sensorimotor control if the perception system is trained jointly with the 
control policy, rather than separately? In order to represent a policy that performs both 
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Figure 1: Our method learns visuomotor policies that directly use camera image observa¬ 
tions (left) to set motor torques on a PR2 robot (right). 


perception and control, we use deep neural networks. Deep neural network representations 
have recently seen widespread success in a variety of domains, such as computer vision and 
speech recognition, and even playing video games. However, using deep neural networks for 
real-world sensorimotor policies, such as robotic controllers that map image pixels and joint 
angles to motor torques, presents a number of unique challenges. Successful applications of 
deep neural networks typically rely on large amounts of data and direct supervision of the 
output, neither of which is available in robotic control. Real-world robot interaction data 
is scarce, and task completion is defined at a high level by means of a cost function, which 
means that the learning algorithm must determine on its own which action to take at each 
point. From the control perspective, a further complication is that observations from the 
robot’s sensors do not provide us with the full state of the system. Instead, important state 
information, such as the positions of task-relevant objects, must be inferred from inputs 
such as camera images. 

We address these challenges by developing a guided policy search algorithm for senso¬ 
rimotor deep learning, as well as a novel CNN architecture designed for robotic control. 
Guided policy search converts policy search into supervised learning, by iteratively con¬ 
structing the training data using an efficient model-free trajectory optimization procedure. 
We show that this can be formalized as an instance of Bregman ADMM (BADMM) (Wang 
and Banerjee, 2014), which can be used to show that the algorithm converges to a locally 
optimal solution. In our method, the full state of the system is observable at training time, 
but not at test time. For most tasks, providing the full state simply requires position¬ 
ing objects in one of several known positions for each trial during training. At test time, 
the learned CNN policy can handle novel, unknown configurations, and no longer requires 
full state information. Since the policy is optimized with supervised learning, we can use 
standard methods like stochastic gradient descent for training. Our CNNs have 92,000 pa¬ 
rameters and 7 layers, including a novel spatial feature point transformation that provides 
accurate spatial reasoning and reduces overfitting. This allows us to train our policies with 
relatively modest amounts of data and only tens of minutes of real-world interaction time. 

We evaluate our method by learning policies for inserting a block into a shape sorting 
cube, screwing a cap onto a bottle, fitting the claw of a toy hammer under a nail with various 
grasps, and placing a coat hanger on a rack with a PR2 robot (see Figure 1). These tasks 
require localization, visual tracking, and handling complex contact dynamics. Our results 
demonstrate improvements in consistency and generalization from training visuomotor poli¬ 
cies end-to-end, when compared to training the vision and control components separately. 
We also present simulated comparisons that show that guided policy search outperforms a 
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number of prior methods when training high-dimensional neural network policies. Some of 
the material in this article has previously appeared in two conference papers (Levine and 
Abbeel, 2014; Levine et ah, 2015), which we extend to introduce visual input into the policy. 

2. Related Work 

Reinforcement learning and policy search methods (Gullapalli, 1990; Williams, 1992) have 
been applied in robotics for playing games such as table tennis (Kober et ah, 2010b), object 
manipulation (Gullapalli, 1995; Peters and Schaal, 2008; Kober et ah, 2010a; Deisenroth 
et ah, 2011; Kalakrishnan et ah, 2011), locomotion (Benbrahim and Franklin, 1997; Kohl 
and Stone, 2004; Tedrake et ah, 2004; Geng et ah, 2006; Endo et ah, 2008), and flight (Ng 
et ah, 2004). Several recent papers provide surveys of policy search in robotics (Deisenroth 
et ah, 2013; Kober et ah, 2013). Such methods are typically applied to one component of 
the robot control pipeline, which often sits on top of a hand-designed controller, such as 
a PD controller, and accepts processed input, for example from an existing vision pipeline 
(Kalakrishnan et ah, 2011). Our method learns policies that map visual input and joint 
encoder readings directly to the torques at the robot’s joints. By learning the entire map¬ 
ping from perception to control, the perception layers can be adapted to optimize task 
performance, and the motor control layers can be adapted to imperfect perception. 

We represent our policies with convolutional neural networks (CNNs). CNNs have a 
long history in computer vision and deep learning (Fukushima, 1980; LeCun et ah, 1989; 
Schmidhuber, 2015), and have recently gained prominence due to excellent results on a 
number of vision benchmarks (Ciresan et ah, 2011; Krizhevsky et ah, 2012; Ciresan et ah, 
2012; Girshick et ah, 2014a; Tompson et ah, 2014; LeCun et ah, 2015; He et ah, 2015). Most 
applications of CNNs focus on classification, where locational information is discarded by 
means of successive pooling layers to provide for invariance (Lee et ah, 2009). Applications 
to localization typically either use a sliding window (Girshick et ah, 2014a) or object pro¬ 
posals (Endres and Hoiem, 2010; Uijlings et ah, 2013; Girshick et ah, 2014b) to localize 
the object, reducing the task to classification, perform regression to a heatmap of manually 
labeled keypoints (Tompson et ah, 2014), requiring precise knowledge of the object posi¬ 
tion in the image and camera calibration, or use 3D models to localize previously scanned 
objects (Pepik et ah, 2012; Savarese and Fei-Fei, 2007). Many prior robotic applications of 
CNNs do not directly consider control, but employ CNNs for the perception component of 
a larger robotic system (Hadsell et ah, 2009; Sung et ah, 2015; Lenz et ah, 2015b; Pinto and 
Gupta, 2015). We use a novel CNN architecture for our policies that automatically learn 
feature points that capture spatial information about the scene, without any supervision 
beyond the information from the robot’s encoders and camera. 

Applications of deep learning in robotic control have been less prevalent in recent years 
than in visual recognition. Backpropagation through the dynamics and the image for¬ 
mation process is typically impractical, since they are often non-differentiable, and such 
long-range backpropagation can lead to extreme numerical instability, since the lineariza¬ 
tion of a suboptimal policy is likely to be unstable. This issue has also been observed 
in the related context of recurrent neural networks (Hochreiter et ah, 2001; Pascanu and 
Bengio, 2012). The high dimensionality of the network also makes reinforcement learning 
difficult (Deisenroth et ah, 2013). Pioneering early work on neural network control used 
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small, simple networks (Pomerleau, 1989; Hunt et al., 1992; Bekey and Goldberg, 1992; 
Lewis et al., 1998; Bakker et al., 2003; Mayer et ak, 2006), and has largely been supplanted 
by methods that use carefully designed policies that can be learned efficiently with rein¬ 
forcement learning (Kober et ah, 2013). More recent work on sensorimotor deep learning 
has tackled simple task-space motions (Lenz et al., 2015a; Lampe and Riedmiller, 2013) 
and used unsupervised learning to obtain low-dimensional state spaces from images (Lange 
et al., 2012). Such methods have been demonstrated on tasks with a low-dimensional un¬ 
derlying structure: Lenz et al. (2015a) controls the end-effector in 2D space, while Lange 
et al. (2012) controls a 2-dimensional slot car with 1-dimensional actions. Our experiments 
include full torque control of 7-DoF robotic arms interacting with objects, with 30-40 state 
dimensions. In simple synthetic environments, control from images has been addressed with 
image features (Jodogne and Piater, 2007), nonparametric methods (van Hoof et al., 2015), 
and unsupervised state-space learning (Bohmer et al., 2013; Jonschkowski and Brock, 2014). 
CNNs have also been trained to play video games with Q-learning, Monte Carlo tree search, 
and stochastic search (Mnih et al., 2013; Koutnfk et al., 2013; Guo et al., 2014), and have 
been applied to simple simulated control tasks (Watter et al., 2015; Lillicrap et al., 2015). 
However, such methods have only been demonstrated on synthetic domains that lack the 
visual complexity of the real world, and require an impractical number of samples for real- 
world robotic learning. Our method is sample efficient, requiring only minutes of interaction 
time. To the best of our knowledge, this is the first method that can train deep visuomotor 
policies for complex, high-dimensional manipulation skills with direct torque control. 

Learning visuomotor policies on a real robot requires handling complex observations 
and high dimensional policy representations. We tackle these challenges using guided pol¬ 
icy search. In guided policy search, the policy is optimized using supervised learning, which 
scales gracefully with the dimensionality of the policy. The training set for supervised learn¬ 
ing can be constructed using trajectory optimization under known dynamics (Levine and 
Koltun, 2013a b, 2014; Mordatch and Todorov, 2014) and trajectory-centric reinforcement 
learning methods that operate under unknown dynamics (Levine and Abbeel, 2014; Levine 
et al., 2015), which is the approach taken in this work. In both cases, the supervision is 
adapted to the policy, to ensure that the final policy can reproduce the training data. The 
use of supervised learning in the inner loop of iterative policy search has also been pro¬ 
posed in the context of imitation learning (Ross et al., 2011, 2013). However, such methods 
typically do not address the question of how the supervision should be adapted to the policy. 

The goal of our approach is also similar to visual servoing, which performs feedback 
control on feature points in a camera image (Espiau et al., 1992; Mohta et al., 2014; Wilson 
et al., 1996). However, our visuomotor policies are entirely learned from real-world data, 
and do not require feature points or feedback controllers to be specified by hand. This allows 
our method much more flexibility in choosing how to use the visual signal. Our approach 
also does not require any sort of camera calibration, in contrast to many visual servoing 
methods (though not all - see e.g. Jagersand et al. (1997); Yoshimi and Allen (1994)). 

3. Background and Overview 

In this section, we define the visuomotor policy learning problem and present an overview 
of our approach. The core component of our approach is a guided policy search algorithm 
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that separates the problem of learning visuomotor policies into separate supervised learning 
and trajectory learning phases, each of which is easier than optimizing the policy directly. 
We also discuss a policy architecture suitable for end-to-end learning of vision and control, 
and a training setup that allows our method to be applied to real robotic platforms. 

3.1 Definitions and Problem Formulation 

In policy search, the goal is to learn a policy no(ut\ot) that allows an agent to choose 
actions in response to observations o t to control a dynamical system, such as a robot. 
The policy comes from some parametric class parameterized by 0, which could be, for 
example, the weights of a neural network. The system is defined by states x^, actions 
u^, and observations o t . For example, might include the joint angles of the robot, the 
positions of objects in the world, and their time derivatives, might consist of motor 
torque commands, and o t might include an image from the robot’s onboard camera. In 
this paper, we address finite horizon episodic tasks with t E [1,.. . ,T]. The states evolve in 
time according to the system dynamics p(x^ + i|xt, u^), and the observations are, in general, 
a stochastic consequence of the states, according to p(ot|x^). Neither the dynamics nor the 
observation distribution are assumed to be known in general. For notational convenience, 
we will use 7Vo(u t \^t) to denote the distribution over actions under the policy conditioned on 
the state. However, since the policy is conditioned on the observation o^, this distribution 
is in fact given by ^(u^x*) = f 7ro(ut\ot)p(ot\^t)do t . The dynamics and ^(u^x*) together 
induce a distribution over trajectories r = {xi, ui, X 2 , U 2 ,..., xj-, ux}: 

T 

voir) = p(xi) 71-0(ut |x t )p(x t+ i |x t , u t ). 

t=1 

The goal of a task is given by a cost function ^(x*, U*), and the objective in policy search is 
to minimize the expectation E ne ^[Y2t= i ^(x*, u*)], which we will abbreviate as E 7Te ^[£(r)\. 
A summary of the notation used in the paper is provided in Table 1. 

3.2 Approach Summary 

Our methods consists of two main components, which are illustrated in Figure 3. The first is 
a supervised learning algorithm that trains policies of the form TCo(ut |o*) = J\f(p n (ot), ^(o*)), 
where both p^ipt) and FTfo t) are general nonlinear functions. In our implementation, 
p n (ot) is a deep convolutional neural network, while T^ipt) is an observation-independent 
learned covariance, though other representations are possible. The second component is a 
trajectory-centric reinforcement learning (RL) algorithm that generates guiding distribu¬ 
tions pi(u t \xt) that provide the supervision used to train the policy. These two components 
form a policy search algorithm that can be used to learn complex robotic tasks using only a 
high-level cost function ^(x*,u*). During training, only samples from the guiding distribu¬ 
tions pi(ut\*t) are generated by running rollouts on the physical system, which avoids the 
need to execute partially trained neural network policies on physical hardware. 

Supervised learning will not, in general, produce a policy with good long-horizon per¬ 
formance, since a small mistake on the part of the policy will place the system into states 
that are outside the distribution in the training data, causing compounding errors. To 
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symbol 

definition 

example / details 

x t 

Markovian system state at time step t G 
[1 ,T] 

joint angles, end-effector pose, object posi¬ 
tions, and their velocities; dimensionality: 
14 to 32 

U t 

control or action at time step t G [1,T] 

joint motor torque commands; dimensional¬ 
ity: 7 (for the PR2 robot) 

o t 

observation at time step t G [1,T] 

RGB camera image, joint encoder readings 
& velocities, end-effector pose; dimensional¬ 
ity: around 200,000 

T 

trajectory: 

r = {xi,Ui,x 2 ,u 2 ,...,xt,ut} 

notational shorthand for a sequence of states 
and actions 

£(x t ,x t ) 

cost function that defines the goal of the task 

distance between an object in the gripper 
and the target 

p(x t+ i|xt,u t ) 

unknown system dynamics 

physics that govern the robot and any ob¬ 
jects it interacts with 

p(o*|xt) 

unknown observation distribution 

stochastic process that produces camera im¬ 
ages from system state 

7To(u t \Ot) 

learned nonlinear global policy parameter¬ 
ized by weights 0 

convolutional neural network, such as the 
one in Figure 2 

7Te(ut\x t ) 

f 7T0(Ut|Ot)p(Ot|x t )dOt 

notational shorthand for observation-based 
policy conditioned on state 

Pi(u t \x t ) 

learned local time-varying linear-Gaussian 
controller for initial state x\ 

time-varying linear-Gaussian controller has 
form AT(KtiX t + k ti , C u) 

7Vo(t) 

trajectory distribution for 7re(ut\xt): 

p(x i) nLi Mu t |x t )p(x t+ i|x t , Ut) 

notational shorthand for trajectory distribu¬ 
tion induced by a policy 


Table 1: Summary of the notation frequently used in this article. 


avoid this issue, the training data must come from the policy’s own state distribution (Ross 
et ah, 2011). We achieve this by alternating between trajectory-centric RL and supervised 
learning. The RL stage adapts to the current policy providing supervision at 

states that are iteratively brought closer to the states visited by the policy. This is for¬ 
malized as a variant of the BADMM algorithm (Wang and Banerjee, 2014) for constrained 
optimization, which can be used to show that, at convergence, the policy 7Vo(ut\o t ) and the 
guiding distributions pi(u t \x t ) will exhibit the same behavior. This algorithm is derived 
in Section 4, The guiding distributions are substantially easier to optimize than learning 
the policy parameters directly (e.g., using model-free reinforcement learning), because they 
use the full state of the system x^, while the policy no(ut\o t ) only uses the observations. 
This means that the method requires the full state to be known during training, but not 
at test time. This makes it possible to efficiently learn complex visuomotor policies, but 
imposes additional assumptions on the observability of during training that we discuss 
in Section 4. 

When learning visuomotor tasks, the policy 7ro(ut\o t ) is represented by a novel convo¬ 
lutional neural network (CNN) architecture, which we describe in Section 5.2, CNNs have 
enjoyed considerable success in computer vision (LeCun et ah, 2015), but the most popular 
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RGB image convl conv2 conv3 spatial softmax feature motor 



Figure 2: Visuomotor policy architecture. The network contains three convolutional lay¬ 
ers, followed by a spatial softmax and an expected position layer that converts pixel-wise 
features to feature points, which are better suited for spatial computations. The points are 
concatenated with the robot configuration, then passed through three fully connected layers 
to produce the torques. 


architectures rely on large datasets and focus on semantic tasks such as classification, often 
intentionally discarding spatial information. Our architecture, illustrated in Figure 2, uses 
a fixed transformation from the last convolutional layer to a set of spatial feature points, 
which form a concise representation of the visual scene suitable for feedback control. Our 
network has 7 layers and around 92,000 parameters, which presents a major challenge for 
standard policy search methods (Deisenroth et ah, 2013). 


To reduce the amount of experience needed to train 
visuomotor policies, we also introduce a pretraining 
scheme that allows us to train effective policies with 
a relatively small number of iterations. The pretraining 
steps are illustrated in Figure 3. The intuition behind 
our pretraining is that, although we ultimately seek to 
obtain sensorimotor policies that combine both vision 
and control, low-level aspects of vision can be initialized 
independently. To that end, we pretrain the convolu¬ 
tional layers of our network by predicting elements of x* 
that are not provided in the observation o t , such as the 
positions of objects in the scene. We also initially train 
the guiding trajectory distributions pi(u t \x t ) indepen¬ 
dently of the convolutional network until the trajecto¬ 
ries achieve a basic level of competence at the task, and 
then switch to full guided policy search with end-to-end 
training of 7re(ut\ot). In our implementation, we also 
initialize the first layer filters from the model of Szegedy 
et al. (2014), which is trained on ImageNet (Deng et ah, 
2009) classification. The initialization and pretraining 
scheme is described in Section 5.2, 


requires robot 
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Figure 3: Diagram of our ap¬ 
proach, including the main guided 
policy search phase and initializa¬ 
tion phases. 


4. Guided Policy Search with BADMM 

Guided policy search transforms policy search into a supervised learning problem, where 
the training set is generated by a simple trajectory-centric RL algorithm. This algorithm 
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optimizes linear-Gaussian controllers ^(u^x*), and is described in Section 4.2, We refer 
to the trajectory distribution induced by pi(u t \x t ) as Pi(r). Each pi(u t \x t ) succeeds from 
different initial states. For example, in the task of placing a cap on a bottle, these initial 
states correspond to different positions of the bottle. By training on trajectories for multiple 
bottle positions, the final CNN policy can succeed from all initial states, and can generalize 
to other states from the same distribution. 

The final policy 7ro(u t \o t ) learned with guided policy search 
is only provided with observations o t of the full state x*, and 
the dynamics are assumed to be unknown. A diagram of 
this method, which corresponds to an expanded version of the 
guided policy search box in Figure 3, is shown on the right. In 
the outer loop, we draw sample trajectories {r-} for each ini¬ 
tial state on the physical system by running the corresponding 
controller pi(ut\x.t)- The samples are used to fit the dynamics 
Pi(x*+i|xt, ut) that are used to improve ^(u^x^), and serve as 
training data for the policy. The inner loop alternates between 
optimizing each Pi(r) and optimizing the policy to match these trajectory distributions. 
The policy is trained to predict the actions along each trajectory from the observations 
o^, rather than the full state x*. This allows the policy to directly use raw observations 
at test time. This alternating optimization can be framed as an instance of the BADMM 
algorithm (Wang and Banerjee, 2014), which converges to a solution where the trajectory 
distributions and the policy have the same state distribution. This allows greedy supervised 
training of the policy to produce a policy with good long-horizon performance. 

4.1 Algorithm Derivation 

Policy search methods minimize the expected cost E ne [£(t)\, where r = {xi, ui,..., xj-, u t} 
is a trajectory, and £{r) = Ylt =l u t) the cost of an episode. In the fully observed case, 

the expectation is taken under i tq{t) = p(xi) Y\[ =1 ^9( u t| x 0p( x t+i| x ^ u t)- The final policy 
7Vo(u t \o t ) is conditioned on the observations o*, but 7T0(u*|x*) can be recovered as 7r#(Lp| x*) = 
f 7T0(ut\ot)p(ot\'x.t)dot- We will present the derivation in this section for ^(u^x^), but we 
do not require knowledge of p(o t \x. t ) in the final algorithm. As discussed in Section 4.3, the 
integral will be evaluated with samples from the real system, which include both x* and o t. 
We begin by rewriting the expected cost minimization as a constrained problem: 

min E p [£(t)] s.t. p( u t |x t ) = no(u t \x t ) Vx*,u*,i, (1) 

P^e 

where we will refer to p(r) as a guiding distribution. This formulation is equivalent to the 
original problem, since the constraint forces the two distributions to be identical. However, 
if we approximate the initial state distribution p(xi) with samples x^, we can choose p(r) 
to be a class of distributions that is much easier to optimize than i r#, as we will show later. 
This will allow us to use simple local learning methods for p(r), without needing to train 
the complex neural network policy 7r${ut\ot) directly with reinforcement learning, which 
would require a prohibitive amount of experience on real physical systems. 

The constrained problem can be solved by a dual descent method, which alternates 
between minimizing the Lagrangian with respect to the primal variables, and incrementing 
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the Lagrange multipliers by their subgradient. Minimization of the Lagrangian with respect 
to p(r) and 9 is done in alternating fashion: minimizing with respect to 9 corresponds to 
supervised learning (making i tq match p(r)), and minimizing with respect to p{r) consists of 
one or more trajectory optimization problems. The dual descent method we use is based on 
BADMM (Wang and Banerjee, 2014), a variant of ADMM (Boyd et al., 2011) that augments 
the Lagrangian with a Bregman divergence between the constrained variables. We use the 
KL-divergence as the Bregman constraint, which is particularly convenient for working 
with probability distributions. We will also modify the constraint p(ut\x t ) = ^(u^x*) by 
multiplying both sides by p(x^), to get p(ut|x^)p(x^) = 7Vo(u t |x^)p(xt). This constraint is 
equivalent, but has the convenient property that we can express the Lagrangian in terms of 
expectations. The BADMM augmented Lagrangians for 9 and p are therefore given by 

T 

Q(o,p) = y ^p(xt,Ut) 5 U t)] A ^p(xt)7T0(ut|xt) [Axt,Ut] 

t= 1 
T 

A>M) = ^p(xt,ut) [^( x £ 5 u t)] A ^p(xt)7T0(ut|xt) [^Xt,Ut] 
t =1 

where A Xt , Ut is the Lagrange multiplier for state x* and action u t at time t, and (j)° t (9,p ) are 
<jJt{0,p) are expectations of the KL-divergences: 

0t(p>0) = ^(x t )[^KL(TK|x t )||7r 0 (u t |x t ))] 

= ^(x t )[ £) KL(^(u t |x t ))||p(u t |x t )]. 

Dual descent with alternating primal minimization is then described by the following steps: 

T 

6 <- arg mm E ^p(xt)7T0(ut|xt) [Ax£,uJ + Vt4>t{0lP) 

t= 1 
T 

p V- arg min E £p(x t ,u«)^( x t>ut) - A XtiUt ] + Vt$(p,6) 
p t =1 

Ax t ,u f y- A Xt ,u t + ai/t(7r 0 (ut|xt)p(x t ) -p(u t |x t )p(x t )). 

This procedure is an instance of BADMM, and therefore inherits its convergence guarantees. 
Note that we drop terms that are independent of the optimization variables on each line. 
The parameter a is a step size. As with most augmented Lagrangian methods, the weight 
vt is set heuristically, as described in Appendix A.l, 

The dynamics only affect the optimization with respect to p(r). In order to make this 
optimization efficient, we choose p{r) to be a mixture of N Gaussians Pi(r), one for each 
initial state sample x\. This makes the action conditionals pi(ut\x.t) and the dynamics 
Pi(x*+i|x£, ut) linear-Gaussian, as discussed in Section 4.2. This is a reasonable choice 
when the system is deterministic, or the noise is Gaussian or small, and we found that this 
approach is sufficiently tolerant to noise for use on real physical systems. Our choice of p 
also assumes that the policy no(ut\ot) is conditionally Gaussian. This is also reasonable, 
since the mean and covariance of 7ro(ut\ot) can be any nonlinear function of the observations 


^p(x t ,ut)[Axt,u t ] + (9, p) 

— ^p(x t ,Ut)[^X t ,Ut] 4“ Vt4>t 
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o^, which themselves are a function of the unobserved state x*. In Section 4.2, we show how 
these assumptions enable each pi(r) to be optimized very efficiently. We will refer to pi(r) 
as guiding distributions, since they serve to focus the policy on good, low-cost behaviors. 

Aside from learning p/(r), we must choose a tractable way to represent the infinite set 
of constraints p(ut\^t)p(^t) — 7 ^( 11 *|x*)p(xt). One approximate approach proposed in prior 
work is to replace the exact constraints with expectations of features (Peters et ah, 2010). 
When the features consist of linear, quadratic, or higher order monomial functions of the 
random variable, this can be viewed as a constraint on the moments of the distributions. If 
we only use the first moment, we get a constraint on the expected action: E p ( Ut | Xt ) p ( Xt )[ u t] = 
£ , 7r0 (u t |x t )p(xt)[ u t]- If the stochasticity in the dynamics is low, as we assumed previously, the 
optimal solution for each Pi(r) will have low entropy, making this first moment constraint 
a reasonable approximation. The KL-divergence terms in the augmented Lagrangians will 
still serve to softly enforce agreement between the higher moments. While this simplification 
is quite drastic, we found that it was more stable in practice than including higher moments, 
likely because these higher moments are harder to estimate accurately with a limited number 
of samples. The alternating optimization is now given by 

T 

6 <- arg mm E -^p(xt)7T0(llt|xt) [ U t / V^] (2) 

t=1 
T 

p arg min E E p(xt,u t)[^( x t, u <) - Vl + ^(p, 0) (3) 

P t=1 

^fit + C ^ z/ t(^7r 0 (ut|xt)p(xt) [ u t] — -^p(ut|xt)p(xt) [ u t])? 

where Ais the Lagrange multiplier on the expected action at time t. In the rest of 
the paper, we will use Cq(6,p) and C p (p, 9) to denote the two augmented Lagrangians in 
Equations (2) and (3), respectively. In the next two sections, we will describe how £ p (p,9) 
can be optimized with respect to p under unknown dynamics, and how Cq(6,p) can be 
optimized for complex, high-dimensional policies. Implementation details of the BADMM 
optimization are presented in Appendix A.l, 

4.2 Trajectory Optimization under Unknown Dynamics 

Since the Lagrangian C p (p,9) in the previous section factorizes over the mixture elements 
in p(r) — iPi ( T )> we describe the trajectory optimization method for a single Gaussian 
p(r). When there are multiple mixture elements, this procedure is applied in parallel to each 
Pi(r). Since p(r) is Gaussian, the conditionals p(xt+i|x*, u^) andp(u^x^), which correspond 
to the dynamics and the controller, are time-varying linear-Gaussian, and given by 

p(u t |x f ) = J\f(K t x t + kj, Ct) p(x t + i|x t , Ut) = V(/x4Xf + / u tUt + f c t, Ft). 

This type of controller can be learned efficiently with a small number of real-world samples, 
making it a good choice for optimizing the guiding distributions. Since a different set of time- 
varying linear-Gaussian dynamics is fitted for each initial state, this dynamics representation 
can model any continuous deterministic system that can be locally linearized. Stochastic 
dynamics can violate the local linearity assumption in principle, but we found that in 
practice this representation was well suited for a wide variety of noisy real-world tasks. 
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The dynamics are determined by the environment. If they are known, p(ut\x.t) can be 
optimized with a variant of the iterative linear-quadratic-Gaussian regulator (iLQG) (Li and 
Todorov, 2004; Levine and Koltun, 2013a), which is a variant of DDP (Jacobson and Mayne, 
1970). In the case of unknown dynamics, we can fit p(x^ + i|x^, Ut) to sample trajectories 
sampled from the trajectory distribution at the previous iteration, denoted p(r). If p(r) is 
too different from p(r), these samples will not give a good estimate of p(xt+i|xt, u^), and 
the optimization will diverge. To avoid this, we can bound the change from p(r) to p(r) in 
terms of their KL-divergence by a step size e, producing the following constrained problem: 

, min, £ P (Pi°) s -t- D kl {p(t)\\p(t)) < e. 

This type of policy update has previously been proposed by several authors in the con¬ 
text of policy search (Bagnell and Schneider, 2003; Peters and Schaal, 2008; Peters et ah, 
2010; Levine and Abbeel, 2014). In the case when p{r) is Gaussian, this problem can be 
solved efficiently using dual gradient descent, while the dynamics p(x^+i|x^, ut) are fitted 
to samples gathered by running the previous controller p(u^xt) on the robot. Fitting a 
global Gaussian mixture model to tuples (x*, u^, xt+i) and using it as a prior for fitting the 
dynamics p(x*+i|xt, Ut) serves to greatly reduce the sample complexity. We describe the 
dynamics fitting procedure in detail in Appendix A.3, 

Note that the trajectory optimization cost function £ p (p,6) also depends on the policy 
tyq (u.£|x^), while we only have access to Tr^u^o*). In order to compute a local quadratic 
expansion of the KL-divergence term DKh(p(^t\'^t)\\'^e( u t\' x -t)) inside £ p (p, 9) for iLQG, we 
also estimate a linearization of the mean of the conditionally Gaussian policy 7Vo(ut\o t ) with 
respect to the state x*, using the same procedure that we use to linearize the dynamics. The 
data for this estimation consists of tuples {x£, i?^ Ut | 0 |)[ut]}, which we can obtain because 
both the states xj and the observations o\ are available for all of the samples evaluated on 
the real physical system. 

This constrained optimization is performed in the “inner loop” of the optimization 
described in the previous section, and the KL-divergence constraint Dkl(p(t)\\ p(r)) < e 
imposes a step size on the trajectory update. The overall algorithm then becomes an 
instance of generalized BADMM (Wang and Banerjee, 2014). Note that the augmented 
Lagrangian £ p (p, 0) consists of an expectation under p(r) of a quantity that is independent of 
p. We can locally approximate this quantity with a quadratic by using a quadratic expansion 
of ^(xj, ut), and fitting a linear-Gaussian to 7r#(ut|xt) with the same method we used for the 
dynamics. We can then solve the primal optimization in the dual gradient descent procedure 
with a standard LQR backward pass. This is significantly simpler and much faster than 
the forward-backward dynamic programming procedure employed in previous work (Levine 
and Abbeel, 2014; Levine and Koltun, 2014). This improvement is enabled by the use of 
BADMM, which allows us to always formulate the KL-divergence term in the Lagrangian 
with the distribution being optimized as the first argument. Since the KL-divergence is 
convex in its first argument, this makes the corresponding optimization significantly easier. 
The details of this LQR-based dual gradient descent algorithm are derived in Appendix A.4. 

We can further improve the efficiency of the method by allowing samples from multiple 
trajectories Pi(r) to be used to fit a shared dynamics p(x*+i|x*, ut), while the controllers 
Pi(ut\x.t) are allowed to vary. This makes sense when the initial states of these trajectories 
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are similar, and they therefore visit similar regions. This allows us to draw just a single 
sample from each Pi(r) at each iteration, allowing us to handle many more initial states. 

4.3 Supervised Policy Optimization 

Since the policy parameters 6 participate only in the constraints of the optimization problem 
in Equation (1), optimizing the policy corresponds to minimizing the KL-divergence between 
the policy and trajectory distribution, as well as the expectation of A^u^. For a conditional 
Gaussian policy of the form 7ro(u t \o t ) = J\T (p 77 (o t ), E^o t )), the objective is 

N T 

^(^p)=^EE^(x t ,c,)[tr[C t -: 1 S 7r (o t )]-log|S-(o t )| 

i=l t—1 

+(M 7r (°t)-^i(xt))C^ 1 (M 7r (ot)— /4(x t )) + 2X^ 7T (o t )\ , 

where /i^(x^) is the mean ofpi(u*|x*) and Cu is the covariance, and the expectation is eval¬ 
uated using samples from each Pi(r) with corresponding observations o t . The observations 
are sampled from p(o^|x^) by recording camera images on the real system. Since the input 
to ii*{pt) and E 7r (ot) is not the state x^, but only an observation owe can train the policy 
to directly use raw observations. Note that C$(0,p) is simply a weighted quadratic loss on 
the difference between the policy mean and the mean action of the trajectory distribution, 
offset by the Lagrange multiplier. The weighting is the precision matrix of the conditional 
in the trajectory distribution, which is equal to the curvature of its cost-to-go function 
(Levine and Koltun, 2013a). This has an intuitive interpretation: Cq{0,p) penalizes de¬ 
viation from the trajectory distribution, with a penalty that is locally proportional to its 
cost-to-go. At convergence, when the policy 7vo(u t \ot) takes the same actions as p^(u^|x^), 
their Q-functions are equal, and the supervised policy objective becomes equivalent to the 
policy iteration objective (Levine and Koltun, 2014) 

In this work, we optimize C$(0,p) with respect to 6 using stochastic gradient descent 
(SGD), a standard method for neural network training. The covariance of the Gaussian 
policy does not depend on the observation in our implementation, though adding this de¬ 
pendence would be straightforward. Since training complex neural networks requires a 
substantial number of samples, we found it beneficial to include sampled observations from 
previous iterations into the policy optimization, evaluating the action /i^(x^) at their corre¬ 
sponding states using the current trajectory distributions. Since these samples come from 
the wrong state distribution, we use importance sampling and weight them according to 
the ratio of their probability under the current distribution p(x*) and the one they were 
sampled from, which is straightforward to evaluate under the estimated linear-Gaussian 
dynamics (Levine and Koltun, 2013b). 

4.4 Comparison with Prior Guided Policy Search Methods 

We presented a guided policy search method where the policy is trained on observations, 
while the trajectories are trained on the full state. The BADMM formulation of guided 
policy search is new to this work, though several prior guided policy search methods based 
on constrained optimization have been proposed. Levine and Koltun (2014) proposed a 
formulation similar to Equation (1), but with a constraint on the KL-divergence between 
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p(r) and ir$. This results in a more complex, non-convex forward-backward trajectory 
optimization phase. Since the BADMM formulation solves a convex problem during the 
trajectory optimization phase, it is substantially faster and easier to implement and use, 
especially when the number of trajectories pfir) is large. 

The use of ADMM for guided policy search was also proposed by Mordatch and Todorov 
(2014) for deterministic policies under known dynamics. This approach requires known, de¬ 
terministic dynamics and trains deterministic policies. Furthermore, because this approach 
uses a simple quadratic augmented Lagrangian term, it further requires penalty terms on 
the gradient of the policy to account for local feedback. Our approach enforces this feed¬ 
back behavior due to the higher moments included in the KL-divergence term, but does not 
require computing the second derivative of the policy. 

5. End-to-End Visuomotor Policies 

Guided policy search allows us to optimize complex, high-dimensional policies with raw 
observations, such as when the input to the policy consists of images from a robot’s onboard 
camera. However, leveraging this capability to directly learn policies for visuomotor control 
requires designing a policy representation that is both data-efficient and capable of learning 
complex control strategies directly from raw visual inputs. In this section, we describe 
a deep convolutional neural network (CNN) model that is uniquely suited to this task. 
Our approach combines a novel spatial soft-argmax layer with a pretraining procedure that 
provides for flexibility and data-efficiency. 

5.1 Visuomotor Policy Architecture 

Our visuomotor policy runs at 20 Hz on the robot, mapping monocular RGB images and 
the robot configurations to joint torques on a 7 DoF arm. The configuration includes the 
angles of the joints and the pose of the end-effector (defined by 3 points in the space of the 
end-effector), as well as their velocities, but does not include the position of the target ob¬ 
ject or goal, which must be determined from the image. CNNs often use pooling to discard 
the locational information that is necessary to determine positions, since it is an irrelevant 
distractor for tasks such as object classification (Lee et ah, 2009). Because locational in¬ 
formation is important for control, our policy does not use pooling. Additionally, CNNs 
built for spatial tasks such as human pose estimation often also rely on the availability of 
location labels in image-space, such as hand-labeled keypoints (Tompson et ah, 2014). We 
propose a novel CNN architecture capable of estimating spatial information from an image 
without direct supervision in image space. Our pose estimation experiments, discussed in 
Section 5.2, show that this network can learn useful visual features using only 3D position 
information provided by the robot, and no camera calibration. Further training the network 
with guided policy search to directly output motor torques causes it to acquire task-specific 
visual features. Our experiments in Section 6.4 show that this improves performance beyond 
the level achieved with features trained only for pose estimation. 

Our network architecture is shown in Figure 2, The visual processing layers of the 
network consist of three convolutional layers, each of which learns a bank of filters that 
are applied to patches centered on every pixel of its input. These filters form a hierarchy 
of local image features. Each convolutional layer is followed by a rectifying nonlinearity of 
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the form a c ij = max(0, z c {f) for each channel c and each pixel coordinate The third 

convolutional layer contains 32 response maps with resolution 109 x 109. These response 
maps are passed through a spatial softmax function of the form s c ij — e acij e aci ' j '- 
Each output channel of the softmax is a probability distribution over the location of a 
feature in the image. To convert from this distribution to a coordinate representation 
( fcx,fcy ), the network calculates the expected image position of each feature, yielding a 
2D coordinate for each channel: f cx = Y,ij s aj^ij and f cy = where (xij,Vij) 

is the image-space position of the point (i, j) in the response map. Since this is a linear 
operation, it corresponds to a fixed, sparse fully connected layer with weights W c { x — X{j and 
W c j y = yij. The combination of the spatial softmax and expectation operator implement a 
kind of soft-argmax. The spatial feature points ( f cx , fcy ) are concatenated with the robot’s 
configuration and fed into two fully connected layers, each with 40 rectified units, followed 
by linear connections to the torques. The full network contains about 92,000 parameters, 
of which 86,000 are in the convolutional layers. 

The spatial softmax and the expected position computation serve to convert pixel-wise 
representations in the convolutional layers to spatial coordinate representations, which can 
be manipulated by the fully connected layers into 3D positions or motor torques. The 
softmax also provides lateral inhibition, which suppresses low, erroneous activations, only 
keeping strong activations that are more likely to be accurate. This makes our policy 
more robust to distractors, providing generalization to novel visual variation. We compare 
our architecture with more standard alternatives in Section 6.3 and evaluate robustness to 
visual distractors in Section 6.4, However, the proposed architecture is also in some sense 
more specialized for visuomotor control, in contrast to more general standard convolutional 
networks. For example, not all perception tasks require information that can be coherently 
summarized by a set of spatial locations. 


5.2 Visuomotor Policy Training 

The guided policy search trajectory optimization phase uses the 
full state of the system, though the final policy only uses the 
observations. This type of instrumented training is a natural 
choice for many robotics tasks, where the robot is trained under 
controlled conditions, but must then act intelligently in uncon¬ 
trolled, real-world situations. In our tasks, the unobserved vari¬ 
ables are the pose of a target object (e.g. the bottle on which a 
cap must be placed). During training, this target object is typi¬ 
cally held in the robot’s left gripper, while the robot’s right arm 
performs the task, as shown to the right. This allows the robot 
to move the target through a range of known positions. The 
final visuomotor policy does not receive this position as input, 
but must instead use the camera images. Due to the modest 
amount of training data, distractors that are correlated with task-relevant variables can 
hamper generalization. For this reason, the left arm is covered with cloth to prevent the 
policy from associating its appearance with the object’s position. 
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While we can train the visuomotor policy entirely from scratch, the algorithm would 
spend a large number of iterations learning basic visual features and arm motions that 
can more efficiently be learned by themselves, before being incorporated into the policy. 
To speed up learning, we initialize both the vision layers in the policy and the trajectory 
distributions for guided policy search by leveraging the fully observed training setup. To 
initialize the vision layers, the robot moves the target object through a range of random 
positions, recording camera images and the object’s pose, which is computed automatically 
from the pose of the gripper. This dataset is used to train a pose regression CNN, which 
consists of the same vision layers as the policy, followed by a fully connected layer that 
outputs the 3D points that define the target. Since the training set is still small (we use 
1000 images collected from random arm motions), we initialize the filters in the first layer 
with weights from the model of Szegedy et al. (2014), which is trained on ImageNet (Deng 
et al., 2009) classification. After training on pose regression, the weights in the convolutional 
layers are transferred to the policy CNN. This enables the robot to learn the appearance of 
the objects prior to learning the behavior. 


To initialize the linear-Gaussian controllers for each of the initial states, we take 15 
iterations of guided policy search without optimizing the visuomotor policy. This allows 
for much faster training in the early iterations, when the trajectories are not yet successful, 
and optimizing the full visuomotor policy is unnecessarily time consuming. Since we still 
want the trajectories to arrive at compatible strategies for each target position, we replace 
the visuomotor policy during these iterations with a small network that receives the full 
state, which consisted of two layers with 40 rectified linear hidden units in our experiments. 
This network serves only to constrain the trajectories and avoid divergent behaviors from 
emerging for similar initial states, which would make subsequent policy learning difficult. 


After initialization, we train the full visuomotor policy with 
guided policy search. During the supervised policy optimiza¬ 
tion phase, the fully connected motor control layers are first 
optimized by themselves, since they are not initialized with pre¬ 
training. This can be done very quickly because these layers are 
small. Then, the entire network is further optimized end-to-end. 

We found that first training the upper layers before end-to-end 
optimization prevented the convolutional layers from forgetting 
useful features learning during pretraining, when the error sig¬ 
nal due to the untrained upper layers is very large. The entire 
pretraining scheme is summarized in the diagram on the right. 

Note that the trajectories can be pretrained in parallel with the 
vision layer pretraining, which does not require access to the physical system. Further¬ 
more, the entire initialization procedure does not use any additional information that is not 
already available from the robot. 



policy 


6. Experimental Evaluation 

In this section, we present a series of experiments aimed at evaluating our approach and 
answering the following questions: 
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1. How does the guided policy search algorithm compare to other policy search methods 
for training complex, high-dimensional policies, such as neural networks? 

2. Does our trajectory optimization algorithm work on a real robotic platform with 
unknown dynamics, for a range of different tasks? 

3. How does our spatial softmax architecture compare to other, more standard convolu¬ 
tional neural network architectures? 

4. Does training the perception and control systems in a visuomotor policy jointly end- 
to-end provide better performance than training each component separately? 

Evaluating a wide range of policy search algorithms on a real robot would be extremely 
time consuming, particularly for methods that require a large number of samples. We 
therefore answer question (1) by using a physical simulator and simpler policies that do not 
use vision. This also allows us to test the generality of guided policy search on tasks that 
include manipulation, walking, and swimming. To answer question (2), we present a wide 
range of experiments on a PR2 robot. These experiments allow us to evaluate the sample 
efficiency of our trajectory optimization algorithm. To address question (3), we compare 
a range of different policy architectures on the task of localizing a target object (the cube 
in the shape sorting cube task). Since localizing the target object is a prerequisite for 
completing the shape sorting cube task, this serves as a good proxy for evaluating different 
architectures. Finally, we answer the last and most important question (4) by training 
visuomotor policies for hanging a coat hanger on a clothes rack, inserting a block into a 
shape sorting cube, fitting the claw of a toy hammer under a nail with various grasps, and 
screwing on a bottle cap. These tasks are illustrated in Figure 8, 

6.1 Simulated Comparisons to Prior Policy Search Methods 

In this section, we compare our method against prior policy search techniques on a range of 
simulated robotic control tasks. These results previously appeared in our conference paper 
that introduced the trajectory optimization procedure with local linear models (Levine and 
Abbeel, 2014). In these tasks, the state consists of the joint angles and velocities of each 
robot, and the actions tp consist of the torques at each joint. The neural network policies 
used one hidden layer and soft rectifier nonlinearities of the form a = log(l + exp(z)). Since 
these policies use the state as input, they only have a few hundred parameters, far fewer 
than our visuomotor policies. However, even this number of parameters can pose a major 
challenge for prior policy search methods (Deisenroth et ah, 2013). 

Experimental tasks. We simulated 2D and 3D peg insertion, octopus arm control, and 
planar swimming and walking. The difficulty in the peg insertion tasks stems from the need 
to align the peg with the slot and the complex contacts between the peg and the walls, which 
result in discontinuous dynamics. Octopus arm control involves moving the tip of a flexible 
arm to a goal position (Engel et ah, 2005). The challenge in this task stems from its high 
dimensionality: the arm has 25 degrees of freedom, corresponding to 50 state dimensions. 
The swimming task requires controlling a three-link snake, and the walking task requires 
a seven-link biped to maintain a target velocity. The challenge in these tasks comes from 
underactuation. Details of the simulation and cost for each task are in Appendix B.l, 
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Figure 4: Results for learning linear-Gaussian controllers for 2D and 3D insertion, octopus 
arm, and swimming. Our approach uses fewer samples and finds better solutions than prior 
methods, and the GMM further reduces the required sample count. Images in the lower- 
right show the last time step for each system at several iterations of our method, with red 
lines indicating end effector trajectories. 

Prior methods. We compare to REPS (Peters et ah, 2010), reward-weighted regression 
(RWR) (Peters and Schaal, 2007; Kober and Peters, 2009), the cross-entropy method (CEM) 
(Rubinstein and Kroese, 2004), and PILCO (Deisenroth and Rasmussen, 2011). We also 
use iLQG (Li and Todorov, 2004) with a known model as a baseline, shown as a black 
horizontal line in all plots. REPS is a model-free method that, like our approach, enforces 
a KL-divergence constraint between the new and old policy. We compare to a variant 
of REPS that also fits linear dynamics to generate 500 pseudo-samples (Lioutikov et ah, 
2014), which we label “REPS (20 + 500).” RWR is an EM algorithm that fits the policy 
to previous samples weighted by the exponential of their reward, and CEM fits the policy 
to the best samples in each batch. With Gaussian trajectories, CEM and RWR only differ 
in the weights. These methods represent a class of RL algorithms that fit the policy to 
weighted samples, including PoWER and PI2 (Kober and Peters, 2009; Theodorou et ah, 
2010; Stulp and Sigaud, 2012). PILCO is a model-based method that uses a Gaussian 
process to learn a global dynamics model that is used to optimize the policy. We used the 
open-source implementation of PILCO provided by the authors. Both REPS and PILCO 
require solving large nonlinear optimizations at each iteration, while our method does not. 
Our method used 5 rollouts with the Gaussian mixture model prior, and 20 without. Due 
to its computational cost, PILCO was provided with 5 rollouts per iteration, while other 
prior methods used 20 and 100. For all prior methods with free hyperparameters (such as 
the fraction of elites for CEM), we performed hyperparameter sweeps and chose the most 
successful settings for the comparison. 

Gaussian trajectory distributions. In the first set of comparisons, we evaluate only the 
trajectory optimization procedure for training linear-Gaussian controllers under unknown 
dynamics to determine its sample-efficiency and applicability to complex, high-dimensional 
problems. The results of this comparison for the peg insertion, octopus arm, and swimming 
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Figure 5: Comparison on neural network policies. For insertion, the policy was trained to 
search for an unknown slot position on four slot positions (shown above). Generalization 
to new positions is graphed with dashed lines. Note how the end effector (red) follows the 
surface to find the slot, and how the swimming gait is smoother due to the stationary policy. 


tasks appears in Figure 4. The horizontal axis shows the total number of samples, and 
the vertical axis shows the minimum distance between the end of the peg and the bottom 
of the slot, the distance to the target for the octopus arm, or the total distance travelled 
by the swimmer. Since the peg is 0.5 units long, distances above this amount correspond 
to controllers that cannot perform an insertion. Our method learned much more effective 
controllers with fewer samples, especially when using the Gaussian mixture model prior. 
On 3D insertion, it outperformed the iLQG baseline, which used a known model. Contact 
discontinuities cause problems for derivative-based methods like iLQG, as well as methods 
like PILCO that learn a smooth global dynamics model. We use a time-varying local 
model, which preserves more detail, and fitting the model to samples has a smoothing effect 
that mitigates discontinuity issues. Prior policy search methods could servo to the hole, 
but were unable to insert the peg. On the octopus arm, our method succeeded despite 
the high dimensionality of the state and action spaces. 1 Our method also successfully 
learned a swimming gait, while prior model-free methods could not initiate forward motion. 
PILCO also learned an effective gait due to the smooth dynamics of this task, but its GP- 
based optimization required orders of magnitude more computation time than our method, 
taking about 50 minutes per iteration. In the case of prior model-free methods, the high 
dimensionality of the time-varying linear-Gaussian controllers likely caused considerable 
difficulty (Deisenroth et ah, 2013), while our approach exploits the structure of linear- 
Gaussian controllers for efficient learning. 


1. The high dimensionality of the octopus arm made it difficult to run PILCO, though in principle, such 
methods should perform well on this task given the arm’s smooth dynamics. 
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Neural network policies. In the second set of comparisons, shown in Figure 5, we 
compare guided policy search to RWR and CEM 2 on the challenging task of training high¬ 
dimensional neural network policies for the peg insertion and locomotion tasks. The vari¬ 
ant of guided policy search used in this comparison differs somewhat from the version 
described in Section 4, in that it used a simpler dual gradient descent formulation, rather 
than BADMM. In practice, we found the performance of these methods to be very similar, 
though the BADMM variant was substantially faster and easier to implement. 

On swimming, our method achieved similar performance to the linear-Gaussian case, 
but since the neural network policy was stationary, the resulting gait was much smoother. 
Previous methods could only solve this task with 100 samples per iteration, with RWR 
eventually obtaining a distance of 0.5m after 4000 samples, and CEM reaching 2.1m after 
3000. Our method was able to reach such distances with many fewer samples. Following 
prior work (Levine and Koltun, 2013a), the walker trajectory was initialized from a demon¬ 
stration, which was stabilized with simple linear feedback. The RWR and CEM policies 
were initialized with samples from this controller to provide a fair comparison. The graph 
shows the average distance travelled on rollouts that did not fall, and shows that only our 
method was able to learn walking policies that succeeded consistently. 

On peg insertion, the neural network was trained to insert the peg without precise 
knowledge of the position of the hole, resulting in a partially observed problem. The holes 
were placed in a region of radius 0.2 units in 2D and 0.1 units in 3D. The policies were 
trained on four different hole positions, and then tested on four new hole positions to 
evaluate generalization. The hole position was not provided to the neural network, and the 
policies therefore had to search for the hole, with only joint angles and velocities as input. 
Only our method could acquire a successful strategy to locate both the training and test 
holes, although RWR was eventually able to insert the peg into one of the four holes in 2D. 

These comparisons show that training even medium-sized neural network policies for 
continuous control tasks with a limited number of samples is very difficult for many prior 
policy search algorithms. Indeed, it is generally known that model-free policy search meth¬ 
ods struggle with policies that have over 100 parameters (Deisenroth et ah, 2013). In 
subsequent sections, we will evaluate our method on real robotic tasks, showing that it can 
scale from these simulated tasks all the way up to end-to-end learning of visuomotor control. 

6.2 Learning Linear-Gaussian Controllers on a PR2 Robot 

In this section, we demonstrate the range of manipulation tasks that can be learned using 
our trajectory optimization algorithm on a real PR2 robot. These experiments previously 
appeared in our conference paper on guided policy search (Levine et ah, 2015). Since 
performing trajectory optimization is a prerequisite for guided policy search to learn effective 
visuomotor policies, it is important to evaluate that our trajectory optimization can learn 
a wide variety of robotic manipulation tasks under unknown dynamics. The tasks in these 
experiments are shown in Figure 6, while Figure 7 shows the learning curves for each task. 
For all robotic experiments in this article, the tasks were learned entirely from scratch, 

2. PILCO cannot optimize neural network policies, and we could not obtain reasonable results with REPS. 

Prior applications of REPS generally focus on simpler, lower-dimensional policy classes (Peters et ah, 

2010; Lioutikov et ah, 2014). 
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Figure 6: Tasks for linear-Gaussian controller evaluation: (a) stacking lego blocks on a fixed 
base, (b) onto a free-standing block, (c) held in both gripper; (d) threading wooden rings 
onto a peg; (e) attaching the wheels to a toy airplane; (f) inserting a shoe tree into a shoe; 
(g,h) screwing caps onto pill bottles and (i) onto a water bottle. 


linear-Gaussian controller learning curves 


lego block (fixed) 
lego block (free) 
lego block (hand) 
ring on peg 


■ toy airplane 
- shoe tree 

pill bottle 

■ water bottle 



Figure 7: Distance to target point during 
training of linear-Gaussian controllers. The 
actual target may differ due to perturbations. 
Error bars indicate one standard deviation. 


with the initialization of the controllers 
p(ut\*t) described in Appendix B.2, The 
number of samples required to learn each 
controller is around 20-25, substantially 
lower than many prior policy search meth¬ 
ods in the literature (Peters and Schaal, 

2008; Kober et ah, 2010b; Theodorou et ah, 

2010; Deisenroth et ah, 2013). Total learn¬ 
ing time was about ten minutes for each 
task, of which only 3-4 minutes involved sys¬ 
tem interaction. The rest of the time was 
spent resetting the robot to the initial state 
and on computation. 

The linear-Gaussian controllers are optimized for a specific condition - e.g., a specific 
position of the target lego block. To evaluate their robustness to errors in the specified 
target position, we conducted experiments on the lego block and ring tasks where the target 
object (the lower block and the peg) was perturbed at each trial during training, and then 
tested with various perturbations. For each task, controllers were trained with Gaussian 
perturbations with standard deviations of 0, 1, and 2 cm in the position of the target object, 
and each controller was tested with perturbations of radius 0, 1, 2, and 3 cm. Note that 
with a radius of 2 cm, the peg would be placed about one ring-width away from the expected 
position. The results are shown in Table 2, All controllers were robust to perturbations of 
1 cm, and would often succeed at 2 cm. Robustness increased slightly when more noise was 
injected during training, but even controllers trained without noise exhibited considerable 
robustness, since the linear-Gaussian controllers themselves add noise during sampling. We 
also evaluated a kinematic baseline for each perturbation level, which planned a straight 
path from a point 5 cm above the target to the expected (unperturbed) target location. This 
baseline was only able to place the lego block in the absence of perturbations. The rounded 
top of the peg provided an easier condition for the baseline, with occasional successes at 
higher perturbation levels. Our controllers outperformed the baseline by a wide margin. 

All of the robotic experiments discussed in this section may be viewed in the corre¬ 
sponding supplementary video, available online: http: //rll. berkeley. edu/icra2015gps. 
A video illustration of the visuomotor policies, discussed in the following sections, is also 
available: http: //sites. google. com/site/visuomotorpolicy, 
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test perturbation 

lego block 

ring on peg 

0 cm 

1 cm 

2 cm 

3 cm 

0 cm 

1 cm 

2 cm 

3 cm 

hJO 

0 cm 

5/5 

5/5 

3/5 

2/5 

5/5 

5/5 

0/5 

0/5 

fH 

'd 2 

1 cm 

5/5 

5/5 

3/5 

2/5 

5/5 

5/5 

3/5 

0/5 

.g +■= 

2 cm 

5/5 

5/5 

5/5 

3/5 

5/5 

5/5 

3/5 

0/5 

b Qh 

kinematic baseline 

5/5 

0/5 

0/5 

0/5 

5/5 

3/5 

0/5 

0/5 


Table 2: Success rates of linear-Gaussian controllers under target object perturbation. 

6.3 Spatial Softmax CNN Architecture Evaluation 

In this section, we evaluate the neural network architecture that we propose in Section 5.1 
in comparison to more standard convolutional networks. To isolate the architectures from 
other confounding factors, we measure their accuracy on the pose estimation pretraining 
task described in Section 5.2, This is a reasonable proxy for evaluating how well the network 
can overcome two major challenges in visuomotor learning: the ability to handle relatively 
small datasets without overfitting, and the capability to learn tasks that are inherently 
spatial. We compare to a network where the expectation operator after the softmax is 
replaced with a learned fully connected layer, as is standard in the literature, a network 
where both the softmax and the expectation operators are replaced with a fully connected 
layer, and a version of this network that also uses 3x3 max pooling with stride 2 at the 
first two layers. These alternative architectures have many more parameters, since the fully 
connected layer takes the entire bank of response maps from the third convolutional layer 
as input. Pooling helps to reduce the number of parameters, but not to the same degree as 
the spatial softmax and expectation operators in our architecture. 

The results in Table 3 indicate that using the softmax and expectation operators im¬ 
proves pose estimation accuracy substantially. Our network is able to outperform the more 
standard architectures because it is forced by the softmax and expectation operators to 
learn feature points, which provide a concise representation suitable for spatial inference. 
Since most of the parameters in this architecture are in the convolutional layers, which ben¬ 
efit from extensive weight sharing, overfitting is also greatly reduced. By removing pooling, 
our network also maintains higher resolution in the convolutional layers, improving spatial 
accuracy. Although we did attempt to regularize the larger standard architectures with 
higher weight decay and dropout, we did not observe a significant improvement on this 
dataset. We also did not extensively optimize the parameters of this network, such as fil¬ 
ter size and number of channels, and investigating these design decisions further would be 
valuable to investigate in future work. 


network architecture 

test error (cm) 

softmax + feature points (ours) 

1.30 ± 0.73 

softmax + fully connected layer 

2.59 ± 1.19 

fully connected layer 

4.75 ± 2.29 

max-pooling + fully connected 

3.71 ± 1.73 


Table 3: Average pose estimation accuracy and standard deviation with various architec¬ 
tures, measured as average Euclidean error for the three target points in 3D, with ground 
truth determined by forward kinematics from the left arm. 
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(a) hanger (b) cube (c) hammer (d) bottle 

Figure 8: Illustration of the tasks in our visuomotor policy experiments, showing the vari¬ 
ation in the position of the target for the hanger, cube, and bottle tasks, as well as two of 
the three grasps for the hammer, which also included variation in position (not shown). 

6.4 Deep Visuomotor Policy Evaluation 

In this section, we present an evaluation of our full visuomotor policy training algorithm on 
a PR2 robot. The aim of this evaluation is to answer the following question: does training 
the perception and control systems in a visuomotor policy jointly end-to-end provide better 
performance than training each component separately? 

Experimental tasks. We trained policies for hanging a coat hanger on a clothes rack, 
inserting a block into a shape sorting cube, fitting the claw of a toy hammer under a nail with 
various grasps, and screwing on a bottle cap. The cost function for these tasks encourages 
low distance between three points on the end-effector and corresponding target points, low 
torques, and, for the bottle task, spinning the wrist. The equations for these cost functions 
and the details of each task are presented in Appendix B.2, The tasks are illustrated in 
Figure 8, Each task involved variation of 10-20 cm in each direction in the position of the 
target object (the rack, shape sorting cube, nail, and bottle). In addition, the coat hanger 
and hammer tasks were trained with two and three grasps, respectively. The current angle 
of the grasp was not provided to the policy, but had to be inferred from observing the 
robot’s gripper in the camera images. All tasks used the same policy architecture and 
model parameters. 

Experimental conditions. We evaluated the visuomotor policies in three conditions: (1) 
the training target positions and grasps, (2) new target positions not seen during training 
and, for the hammer, new grasps (spatial test), and (3) training positions with visual 
distractors (visual test). A selection of these experiments is shown in the supplementary 
video. 3 For the visual test, the shape sorting cube was placed on a table rather than held in 

3. The video can be viewed at http://sites.google.com/site/visuomotorpolicy 
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the gripper, the coat hanger was placed on a rack with clothes, and the bottle and hammer 
tasks were done in the presence of clutter. Illustrations of this test are shown in Figure 9. 

Comparison. The success rates for each test are shown in Figure 9. We compared to 
two baselines, both of which train the vision layers in advance for pose prediction, instead 
of training the entire policy end-to-end. The features baseline discards the last layer of the 
pose predictor and uses the feature points, resulting in the same architecture as our policy, 
while the prediction baseline feeds the predicted pose into the control layers. The pose 
prediction baseline is analogous to a standard modular approach to policy learning, where 
the vision system is first trained to localize the target, and the policy is trained on top of it. 
This variant achieves poor performance. As discussed in Section 6.3, the pose estimate is 
accurate to about 1 cm. However, unlike the tasks in Section 6.2, where robust controllers 
could succeed even with inaccurate perception, many of these tasks have tolerances of just 
a few millimeters. In fact, the pose prediction baseline is only successful on the coat hanger, 
which requires comparatively little accuracy. Millimeter accuracy is difficult to achieve even 
with calibrated cameras and checkerboards. Indeed, prior work has reported that the PR2 
can maintain a camera to end effector accuracy of about 2 cm during open loop motion 
(Meeussen et ah, 2010). This suggests that the failure of this baseline is not atypical, 
and that our visuomotor policies are learning visual features and control strategies that 
improve the robot’s accuracy. When provided with pose estimation features, the policy 
has more freedom in how it uses the visual information, and achieves somewhat higher 
success rates. However, full end-to-end training performs significantly better, achieving 
high accuracy even on the challenging bottle task, and successfully adapting to the variety 
of grasps on the hammer task. This suggests that, although the vision layer pretraining is 
clearly beneficial for reducing computation time, it is not sufficient by itself for discovering 
good features for visuomotor policies. 

Visual distractors. The policies exhibit moderate tolerance to distractors that are visu¬ 
ally separated from the target object. This is enabled in part by the spatial softmax, which 
has a lateral inhibition effect that suppresses non-maximal activations. Since distractors 
are unlikely to activate each feature as much as the true object, their activations are there¬ 
fore suppressed. However, as expected, the learned policies tend to perform poorly under 
drastic changes to the backdrop, or when the distractors are adjacent to or occluding the 
manipulated objects, as shown in the supplementary video. A standard solution to this 
issue to expose the policy to a greater variety of visual situations during training. This 
issue could also be mitigated by artificially augmenting the image samples with synthetic 
transformations, as discussed in prior work in computer vision (Simard et ah, 2003), or even 
incorporating ideas from transfer and semi-supervised learning. 

6.5 Features Learned with End-to-End Training 

The visual processing layers of our architecture automatically learn features points using 
the spatial softmax and expectation operators. These feature points encapsulate all of the 
visual information received by the motor layers of the policy. In Figure 10, we show the 
features points discovered by our visuomotor policy through guided policy search. Each 
policy learns features on the target object and the robot manipulator, both clearly relevant 
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training visual test 



coat hanger 

training (18) 

spatial test (24) 

visual test (18) 

end-to-end 

100% 

100% 

100% 

pose features 

88.9% 

87.5% 

83.3% 

pose prediction 

55.6% 

58.3% 

66.7% 

shape cube 

training (27) 

spatial test (36) 

visual test (40) 

end-to-end 

96.3% 

91.7% 

87.5% 

pose features 

70.4% 

83.3% 

40% 

pose prediction 

0% 

0% 

n/a 

toy hammer 

training (45) 

spatial test (60) 

visual test (60) 

end-to-end 

91.1% 

86.7% 

78.3% 

pose features 

62.2% 

75.0% 

53.3% 

pose prediction 

8.9% 

18.3% 

n/a 

bottle cap 

training (27) 

spatial test (12) 

visual test (40) 

end-to-end 

88.9% 

83.3% 

62.5% 

pose features 

55.6% 

58.3% 

27.5% 


Success rates on training positions, on novel test positions, and 
in the presence of visual distractors. The number of trials per 
test is shown in parentheses. 


Figure 9: Training and visual test scenes as seen by the policy (left), and experimental 
results (right). The hammer and bottle images were cropped for visualization only. 


to task execution. The policy tends to pick out robust, distinctive features on the objects, 
such as the left pole of the clothes rack, the left corners of the shape-sorting cube and 
the bottom-left corner of the toy tool bench. In the bottle task, the end-to-end trained 
policy outputs points on both sides of the bottle, including one on the cap, while the pose 
prediction network only finds points on the right edge of the bottle. 

In Figure 11, we compare the feature points learned through guided policy search to 
those learned by a CNN trained for pose prediction. After end-to-end training, the policy 
acquired a distinctly different set of feature points compared to the pose prediction CNN 
used for initialization. The end-to-end trained model finds more feature points on task¬ 
relevant objects and fewer points on background objects. This suggests that the policy 
improves its performance by acquiring goal-driven visual features that differ from those 
learned for object localization. 

The feature point representation is very simple, since it assumes that the learned features 
are present at all times, and only one instance of each feature is ever present in the image. 
While this is a drastic simplification, both the pose predictor and the policy still achieve good 
results. A more flexible architecture that still learns a concise feature point representation 
could further improve policy performance. We hope to explore this in future work. 


6.6 Computational Performance and Sample Efficiency 

We used the Caffe deep learning library (Jia et al., 2014) for CNN training. Each visuomotor 
policy required a total of 3-4 hours of training time: 20-30 minutes for the pose prediction 
data collection on the robot, 40-60 minutes for the fully observed trajectory pretraining on 
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(a) hanger (b) cube (c) hammer (d) bottle 

Figure 10: Feature points tracked by the policy during task execution for each of the four 
tasks. Each feature point is displayed in a different random color, with consistent coloring 
across images. The policy finds features on the target object and the robot gripper and 
arm. In the bottle cap task, note that the policy correctly ignores the distractor bottle in 
the background, even though it was not present during training. 



(a) hanger (b) cube (c) hammer (d) bottle 


Figure 11: Feature points learned for each task. For each input image, the feature points 
produced by the policy are shown in blue, while the feature points of the pose prediction 
network are shown in red. The end-to-end trained policy tends to discover more feature 
points on the target object and the robot arm than the pose prediction network. 
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the robot and offline pose pretraining (which can be done in parallel), and between 1.5 and 
2.5 hours for end-to-end training with guided policy search. The coat hanger task required 
two iterations of guided policy search, the shape sorting cube and the hammer required 
three, and the bottle task required four. Only about 15 minutes of the training time 
consisted of executing trials on the robot. Since training was dominated by computation, 
we expect significant speedup from a more efficient implementation. The number of samples 
for training each policy is shown in Table 4. Each trial was five seconds in length, and the 
numbers do not include the time needed to collect about 1000 images for pretraining the 
visual processing layers of the policy. 



number of trials 

task 

trajectory pretraining 

end-to-end training 

total 

coat hanger 

120 

36 

156 

shape cube 

90 

81 

171 

toy hammer 

150 

90 

240 

bottle cap 

180 

108 

288 


Table 4: Total number of trials used for learning each visuomotor policy. 

7. Discussion and Future Work 

In this paper, we presented a method for learning robotic control policies that use raw 
input from a monocular camera. These policies are represented by a novel convolutional 
neural network architecture, and can be trained end-to-end using our guided policy search 
algorithm, which decomposes the policy search problem in a trajectory optimization phase 
that uses full state information and a supervised learning phase that only uses the obser¬ 
vations. This decomposition allows us to leverage state-of-the-art tools from supervised 
learning, making it straightforward to optimize extremely high-dimensional policies. Our 
experimental results show that our method can execute complex manipulation skills, and 
that end-to-end training produces significant improvements in policy performance compared 
to using fixed vision layers trained for pose prediction. 

Although we demonstrate moderate generalization over variations in the scene, our 
current method does not generalize to dramatically different settings, especially when visual 
distractors occlude the manipulated object or break up its silhouette in ways that differ from 
the training. The success of CNNs on exceedingly challenging vision tasks suggests that 
this class of models is capable of learning invariance to irrelevant distractor features (LeCun 
et ah, 2015), and in principle this issue can be addressed by training the policy in a variety 
of environments, though this poses certain logistical challenges. More practical alternatives 
that could be explored in future work include simultaneously training the policy on multiple 
robots, each of which is located in a different environment, developing more sophisticated 
regularization and pretraining techniques to avoid overfitting, and introducing artificial 
data augmentation to encourage the policy to be invariant to irrelevant clutter. However, 
even without these improvements, our method has numerous applications in, for example, 
an industrial setting where the robot must repeatedly and efficiently perform a task that 
requires visual feedback under moderate variation in background and clutter conditions. 

Our method takes advantage of a known, fully observed state space during training. 
This is both a weakness and a strength. It allows us to train linear-Gaussian controllers 
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for guided policy search using a very small number of samples, far more efficiently than 
standard policy search methods. However, the requirement to observe the full state during 
training limits the tasks to which the method can be applied. In many cases, this limitation 
is minor, and the only “instrumentation” required at training is to position the objects in 
the scene at consistent positions. However, tasks that require, for example, manipulating 
freely moving objects require more extensive instrumentation, such as motion capture. A 
promising direction for addressing this limitation is to combine our method with unsuper¬ 
vised state-space learning, as proposed in several recent works, including our own (Lange 
et ah, 2012; Watter et ah, 2015; Finn et ah, 2015). 

In future work, we hope to explore more complex policy architectures, such as recurrent 
policies that can deal with extensive occlusions by keeping a memory of past observations. 
We also hope to extend our method to a wider range of tasks that can benefit from visual 
input, as well as a variety of other rich sensory modalities, including haptic input from 
pressure sensors and auditory input. With a wider range of sensory modalities, end-to- 
end training of sensorimotor policies will become increasingly important: while it is often 
straightforward to imagine how vision might help to localize the position of an object in 
the scene, it is much less apparent how sound can be integrated into robotic control. A 
learned sensorimotor policy would be able to naturally integrate a wide range of modalities 
and utilize them to directly aid in control. 
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Appendix A. Guided Policy Search Algorithm Details 

In this appendix, we describe a number of implementation details of our BADMM-based 
guided policy search algorithm and our linear-Gaussian controller optimization method. 

A.l BADMM Dual Variables and Weight Adjustment 

Recall that the inner loop alternating optimization is given by 

T 

6 <r- arg nun E ^p(xt)7T0(ut|xt) [ U £ / W] 

t= 1 
T 

p <— arg min E ,U t) u t ) — u JX^tl + vt^iP, 0) 

P t= 1 

^llt + OiUf (^7T0(ut|xt)p(xt) [ U ^] _ |xt)p(xt) [ U t])* 

We use a step size of a = 0.1 in all of our experiments, which we found to be more 
stable than a = 1.0. The weights vt are initialized to 0.01 and incremented based on 
the following schedule: at every iteration, we compute the average KL-divergence between 
p(ut\x t ) and 7T0(ut|xt) at each time step, as well as its standard deviation over time steps. 
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The weights vt corresponding to time steps where the KL-divergence is higher than the 
average are increased by a factor of 2, and the weights corresponding to time steps where 
the KL-divergence is two standard deviations or more below the average are decreased by 
a factor of 2. The rationale behind this schedule is to adjust the KL-divergence penalty 
to keep the policy and trajectory in agreement by roughly the same amount at all time 
steps. Increasing vt too quickly can lead to the policy and trajectory becoming “locked” 
together, which makes it difficult for the trajectory to decrease its cost, while leaving it too 
low requires more iterations for convergence. We found this schedule to work well across 
all tasks, both during trajectory pretraining and while training the visuomotor policy. 

To update the dual variables Awe evaluate the expectations over p(x*) by using the 
latest batch of sampled trajectories. For each state {x£} along these sampled trajectories, 
we evaluate the expectations over ip under ^(u^x*) and p(ut|x^), which correspond simply 
to the means of these conditional Gaussian distributions, in closed form. 


A.2 Policy Variance Optimization 

As discussed in Section 4, the variance of the Gaussian policy 7ro(ut\ot) does not depend on 
the observation, though this dependence would be straightforward to add. Analyzing the 
objective we can write out only the terms that depend on XT: 

N T 

4>(M = 5jv E E £ p.(*.«> [“'[OS'S"] - log |S*|]. 

2=1 t— 1 

Differentiating and setting the derivative to zero, we obtain the following equation for X 71 ": 


X^ = 


N T 

jt EE c s‘ 


n -1 


NT 


2=1 t= 1 

where the expectation under Piiptt) is omitted, since C a does not depend on x^. 


A.3 Dynamics Fitting 

Optimizing the linear-Gaussian controllers ^(u^x^) that induce the trajectory distributions 
Pi(r) requires fitting the system dynamics p^(x^ + i|x^, u^) at each iteration to samples gen¬ 
erated on the physical system from the previous controller ^(u^x^). In this section, we 
describe how these dynamics are fitted. As in Section 4, we drop the subscript i, since the 
dynamics are fitted the same way for all of the trajectory distributions. 

The linear-Gaussian dynamics are defined as p(x*+i|x*, u t ) — + fut u t + feu F^), 

and the data that we obtain from the robot can be viewed as tuples {xj, uj, xj +1 }. A 
simple way to fit these linear-Gaussian dynamics is to use linear regression to determine 
/x? /u? and / c , and fit based on the errors. However, the sample complexity of linear 
regression scales with the dimensionality of x^. For a high-dimensional robotic system, we 
might need an impractically large number of samples at each iteration to obtain a good fit. 
However, we can observe that the dynamics at nearby time steps are strongly correlated, 
and we can dramatically reduce the sample complexity of the dynamics fitting by bringing 
in information from other time steps, and even prior iterations. We will bring in this 
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information by fitting a global model to all of the transitions {x£, u|, x£ +1 } for all t and all 
tuples from several prior iterations (we use three prior iterations in our implementation), 
and then use this model as a prior for fitting the dynamics at each time step. Note that 
this global model does not itself need to be a good forward dynamics model - it just needs 
to serve as a good prior to reduce the sample complexity of linear regression. 

To make it more convenient to incorporate a data-driven prior, we will first reformulate 
this linear regression fit and view it as fitting a Gaussian model to the dataset {xj, uj, xj +1 } 
at each time step t, and then conditioning this Gaussian to obtain p(x* + i|x£, u^). While 
this is equivalent to linear regression, it allows us to easily incorporate a normal-inverse- 
Wishart prior on this Gaussian in order to bring in prior information. Let E be the empirical 
covariance of our dataset, and let fi be the empirical mean. The normal-inverse-Wishart 
prior is defined by prior parameters /io, and no- Under this prior, the maximum a 
posteriori estimates for the covariance E and mean fi are given by 

3> + NT, + - /^o)(A “ W)) T muQ + nofi 

£ =-V;- M=-• 

+ no m + tiq 

Having obtained E and /i, we can obtain an estimate of the dynamics p(x^ + i|x^, u t ) by 
conditioning the distribution A /"(/i, E) on [x*; u*], which produces linear-Gaussian dynamics 
p(x^ + i|x^, Ut) = A/^/xtXt + fut u t + fct , Ft)- The parameters of the normal-inverse-Wishart 
prior are obtained from the global model of the dynamics which, as described previously, is 
fitted to all available tuples {x£, uj, x£ +1 }. 

The simplest prior can be obtained by fitting a Gaussian distribution to vectors [x; u; x']. 
If the mean and covariance of this data are given by fl and E, the prior is given by <l> = noE 
and fi o = /2, while no and m should be set to the number of data points in the datasets. In 
practice, settings no and m to 1 tends to produce better results, since the prior is fitted to 
many more samples than are available for linear regression at each time step. While this 
prior is simple, we can obtain a better prior by employing a nonlinear model. 

The particular global model we use in this work is a Gaussian mixture model over 
vectors [x;u;x 7 ]. Systems of articulated rigid bodies undergoing contact dynamics, such 
as robots interacting with their environment, can be coarsely modeled as having piecewise 
linear dynamics. The Gaussian mixture model provides a good approximation for such 
piecewise linear systems, with each mixture element corresponding to a different linear mode 
(Khansari-Zadeh and Billard, 2010). Under this model, the state transition tuple is assumed 
to come from a distribution that depends on some hidden state /i, which corresponds to 
the mixture element identity. In practice, this hidden state might correspond to the type of 
contact profile experienced by a robotic arm at step i. The prior for the dynamics fit at time 
step t is then obtained by inferring the hidden state distribution for the transition dataset 
{xj, uj, x^ +1 }, and using the mean and covariance of the corresponding mixture elements 
(weighted by their probabilities) to obtain fl and E. The prior parameters can then be 
obtained as described above. 

In our experiments, we set the number of mixture elements for the Gaussian mixture 
model prior such that there were at least 40 samples per mixture element, or 20 total 
mixture elements, whichever was lower. In general, we did not find the performance of the 
method to be sensitive to this parameter, though overfitting did tend to occur in the early 
iterations when the number of samples is low, if the number of mixtures was too high. 
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A.4 Trajectory Optimization 

In this section, we show how the LQR backward pass can be used to optimize the constrained 
objective in Section 4.2, The constrained trajectory optimization problem is given by 

, mm C p (p, 0) s.t. D kl ( P (t)\\p(t)) < e. 

The augmented Lagrangian £ p (p,9) consists of an entropy term and an expectation under 
p(r) of a quantity that is independent of p. We can locally approximate this quantity with 
a quadratic by using a quadratic expansion of and fitting a linear Gaussian to 

7Vo(u t \^t) with the same method we used for the dynamics. We can then solve the primal 
optimization in the dual gradient descent procedure with a standard LQR backward pass. 
As discussed in Section 4, C p (p , 0) can be written as the expectation of some function c(r) 
that is independent of p, such that C p {p , 6) = E p ^[c(r)\ — Specifically, 

c(x t ,u t ) = £{x t ,u t ) - uJXpt - ^logvr 0 (u t |x t ) 

Writing the Lagrangian of the constrained optimization, we have 


£(p) = E p{r)[c{r ) - i?log p(t)} - (r? + v t )H{p{T)) - r/e, 

where p is the Lagrange multiplier. Note that C(p) is the Lagrangian of the constrained 
trajectory optimization, which is not related to the augmented Lagrangian C p {r , 6). Group¬ 
ing the terms in the expectation and omitting constants, we can rewrite the minimization 
of the Lagrangian with respect to the primal variables as 


mm 

p(r)eM(r) 


E, 


p( t ) 


1 


L v + vt 


c(r)- 


V + vt 


log p(t) 


n(p(T)). 


(4) 


Let c(t) = ^4_c(rj — yy logp(r). The above optimization corresponds to minimizing 
E p ( r )[c( T )] ~ 7-L(p(r)). This type of maximum entropy problem can be solved using the 
LQR algorithm, and the solution is given by 


p(u t |x t ) = V(K t x t + kt; Q u l ut ), 


where K t and are the feedback and open loop terms of the optimal linear feedback 
controller corresponding to the cost c(xt,u^) and the dynamics p(xt+i|x^, uQ, and Q u ,ut is 
the quadratic term in the Q-function at time step t. All of these terms can be obtained 
from a standard LQR backward pass (Li and Todorov, 2004), which we summarize below. 

Recall that the estimated linear-Gaussian dynamics have the form p(x*_|_i|x*, u*) = 
•A/"(/xtXt + futW-t + fct, F t ). The quadratic cost approximation has the form 

c(xt, Ut) ~ i[x t ; Ut] T c XUjXU j [xjj u t ] + [x t ; Uj] T c xu t + const, 

where subscripts denote derivatives, e.g. c xu t is the gradient of c with respect to [x^;ut], 
while c xu ,xut is the Hessian. 4 Under this model of the dynamics and cost function, the 

4. We assume that all Taylor expansions here are recentered around zero. Otherwise, the point around 
which the derivatives are computed must be subtracted from x t and u t in all of these equations. 
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optimal controller can be computed by recursively computing the quadratic Q-function and 
value function, starting with the last time step. These functions are given by 

V(x t ) = ^x^WxtXt + xJV xt + const 

Q(x. t , ut) = ^[x t ; ut] T <3 XUjXU t[xt; u t ] + [x t ; u t ] T Q xxl t + const 

We can express them with the following recurrence, which is computed starting at the last 
time step t — T and moving backward through time: 

Qxu,xut C X u,XUt T /xut^x,xt+l/xut 

Qxut Oxut T /xu tVxt+1 + /xut^x,xt+l/ct 

^x,xt — Qx,xt — Q u,xtQ u,utQ 
Vyit — Qxt — Qu,xtQu,utQuO 

and the optimal control law is then given by g(x^) — + k^, where K t — — LQ u ,xt 

and = — Qu^utQut- If 5 instead of simply minimizing the expected cost, we instead wish to 
optimize the maximum entropy objective in Equation (4), the optimal controller is instead 
linear-Gaussian, with the solution given by p(ut\^t) — A/^Ktx^ + kt;Qu,ut)? as shown in 
prior work (Levine and Koltun, 2013a). 

Appendix B. Experimental Setup Details 

In this appendix, we present a detailed summary of the experimental setup for our simulated 
and real-world experiments. 

B.l Simulated Experiment Details 

All of the simulated experiments used the MuJoCo simulation package (Todorov et ah, 
2012), with simulated frictional contacts and torque motors at the joints used for actuation. 
Although no control or state noise was added during simulation, noise was injected naturally 
by the linear-Gaussian controllers. The linear-Gaussian controllers p^(ut|x^) were initialized 
to stay near the initial state xi using linear feedback based on a proportional-derivative 
control law for all tasks, except for the octopus arm, where ^(ut|x^) was initialized to be 
zero mean with a fixed spherical covariance, and the walker, which was initialized to track 
a demonstration trajectory with proportional-derivative feedback. The walker was the only 
task that used a demonstration, as described previously. We describe the details of each 
system below. 

Peg insertion: The 2D peg insertion task has 6 state dimensions (joint angles and angular 
velocities) and 2 action dimensions. The 3D version of the task has 12 state dimensions, 
since the arm has 3 degrees of freedom at the shoulder, 1 at the elbow, and 2 at the wrist. 
Trials were 8 seconds in length and simulated at 100 Hz, resulting in 800 time steps per 
rollout. The cost function is given by 

^(x t , Ut) = ^w u ||ut|| 2 + Wpi 12 (p Xt - p*), 
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where p Xt is the position of the end effector for state x*, p* is the desired end effector 
position at the bottom of the slot, and the norm £u(z) is given by ^||z|| 2 + Va? + z 2 , which 
corresponds to the sum of an £2 and soft l\ norm. We use this norm to encourage the peg 
to precisely reach the target position at the bottom of the hole, but to also receive a larger 
penalty when far away. The task also works well in 2D with a simple £2 penalty, though 
we found that the 3D version of the task takes longer to insert the peg all the way into the 
hole without the £i~like square root term. The weights were set to w u = 10 -6 and w p = 1. 
Initial states were chosen by moving the shoulder of the arm relative to the hole, with four 
equally spaced starting states in a 20 cm region for the 2D arm, and four random starting 
states in a 10 cm radius for the 3D arm. 


Octopus arm: The octopus arm consists of six four-sided chambers. Each edge of each 
chamber is a simulated muscle, and actions correspond to contracting or relaxing the mus¬ 
cle. The state space consists of the positions and velocities of the chamber vertices. The 
midpoint of one edge of the first chamber is fixed, resulting in a total of 25 degrees of free¬ 
dom: the 2D positions of the 12 unconstrained points, and the orientation of the first edge. 
Including velocities, the total dimensionality of the state space is 50. The cost function 
depends on the activation of the muscles and distance between the tip of the arm and the 
target point, in the same way as for peg insertion. The weights are set to w u = 10 -3 and 
w p = 1. 

Swimmer: The swimmer consists of 3 links and 5 degrees of freedom, including the global 
position and orientation which, together with the velocities, produces a 10 dimensional state 
space. The swimmer has 2 action dimensions corresponding to the torques between joints. 
The simulation applied drag on each link of the swimmer to roughly simulate a fluid, allowing 
it to propel itself. The rollouts were 20 seconds in length at 20 Hz, resulting in 400 time 
steps per rollout. The cost function for the swimmer is given by 

£(xi,u t ) = ^WuIKH 4 + ^w v \\v Xxt - v* x \\ 2 

where v XXt is the horizontal velocity, v* — 2.0m/s, and the weights were w u = 2 • 10 -5 and 
w v = 1. 

Walker: The bipedal walker consists of a torso and two legs, each with three links, for 
a total of 9 degrees of freedom and 18 dimensions, with velocity, and 6 action dimensions. 
The simulation ran for 5 seconds at 100 Hz, for a total of 500 time steps. The cost function 
is given by 

£(x t ,U t ) = twulKH* + \w V \\v X * t ~ K\\ 2 + \ w h\\Py Xt ~P * y \\ 2 

where v XXt is again the horizontal velocity, p Vy , t is the vertical position of the root, v* = 
2.1m/s, Py ~ 1.1m, and the weights were set to w u = 10 -4 , w v — 1, and Wh = 1. 
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B.2 Robotic Experiment Details 

All of the robotic experiments were conducted on a PR2 robot. The robot was controlled 
at 20 Hz via direct effort control. 5 and camera images were recorded using the RGB camera 
on a PrimeSense Carmine sensor. The images were downsampled to 240 x 240 x 3. The 
learned policies controlled one 7 DoF arm of the robot, while the other arm was used to 
move objects in the scene to automatically vary the initial conditions. The camera was 
kept fixed in each experiment. Each episode was 5 seconds in length. For each task, the 
cost function required placing the object held in the gripper at a particular location (which 
might require, for example, to insert a shape into a shape sorting cube). The cost was given 
by the following equation: 

£(xt,u t ) = Wi 2 cP t + wi og log(dt + a) + w u |K|| 2 , 

where dt is the distance between three points in the space of the end-effector and their 
target positions. 6 , and the weights are set to W£ 2 — 10 -3 , w\ og — 1.0, and w u — 10 -2 . The 
quadratic term encourages moving the end-effector toward the target when it is far, while the 
logarithm term encourages placing it precisely at the target location, as discussed in prior 
work (Levine et ah, 2015). The bottle cap task used an additional cost term consisting of a 
quadratic penalty on the difference between the wrist angular velocity and a target velocity. 

For all of the tasks, we initialized all of the linear-Gaussian controllers pi(u*|x*) to stay 
near the initial state xi, with a diagonal noise covariance. The covariance of the noise was 
chosen to be proportional to a diagonal approximation of the inverse effective mass at each 
joint, as provided by the manufacturer of the PR2 robot, and the feedback controller was 
constructed using LQR, with an approximate linear model obtained from the same diagonal 
inverse mass matrix. The role of this initial controller was primarily to avoid dangerous 
actions during the first iteration. We discuss the particular setup for each experiment below: 

Coat hanger: The coat hanger task required the robot to hang a coat hanger on a clothes 
rack. The coat hanger was grasped at one of two angles, about 35° apart, and the rack was 
positioned at three different distances from the robot during training, with differences of 
about 10 cm between each position. The rack was moved manually between these positions 
during training. A trial was considered successful if, when the coat hanger was released, it 
remained hanging on the rack rather than dropping to the ground. 

Shape sorting cube: The shape sorting cube task required the robot to insert a red 
trapezoid into a trapezoidal hole on a shape sorting cube. During training, the cube was 
positioned at nine different positions, situated at the corners, edges, and middle of a rect¬ 
angular region 16 cm x 10 cm in size. During training, the shape sorting cube was moved 
through the training positions by using the left arm. A trial was considered successful if 
the bottom face of the trapezoid was completely inside the shape sorting cube, such that if 
the robot were to release the trapezoid, it would fall inside the cube. 

5. The PR2 robot does not provide for closed loop torque control, but instead supports an effort control in¬ 
terface that directly sets feedforward motor voltages. In practice, these voltages are roughly proportional 
to feedforward torques, but are also affected by friction and damping. 

6. Three points fully define the pose of the end-effector. For the bottle cap task, which is radially symmetric, 
we use only two points. 
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Toy hammer: The hammer task required the robot to insert the claw of a toy hammer 
underneath a toy plastic nail, placing the claw around the base of the nail. The hammer 
was grasped at one of three angles, each 22.5° apart, for a total variation of 45° degrees, and 
the nail was positioned at five positions, at the corners and center of a rectangular region 
10 cm x 7 cm in size. During training, the toy tool bench containing the nail was moved 
using the left arm. A trial was considered successful if the tip of the claw of the hammer 
was at least under the centerline of the nail. 

Bottle cap: The bottle cap task required the robot to screw a cap onto a bottle at various 
positions. The bottle was located at nine different positions, situated at the corners, edges, 
and middle of a rectangular region 16 cm x 10 cm in size, and the left arm was used to 
move the bottle through the training positions. A trial was considered successful if, after 
completion, the cap could not be removed from bottle simply by pulling vertically. 
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